NOTE: You must have Python Twitter Tools installed on the machine to run this script. You can install it by running the below cell (change the cell type in the toolbar above to Code
instead of Raw NBConvert
). You may need to use "! sudo easy_install twitter"
.
In [10]:
from IPython.core.display import HTML
styles = open("../css/custom.css", "r").read()
HTML(styles)
Out[10]:
First we import the libraries we'll be using.
In [1]:
from twitter import *
import csv, json
import cPickle as pickle
You can find the OAuth credentials needed below in Twitter's application manager. Create a new app if you haven't already. After the app has been created, you'll find the necessary information under "Keys and Access Tokens". You may need to create access tokens.
In [15]:
# Twitter OAuth Credentials
consumer_key = '' # Consumer Key (API Key)
consumer_secret = '' # Consumer Secret (API Secret)
access_token = '' # Access Token
access_secret = '' # Access Token Secret
In [3]:
t = Twitter(auth=OAuth(access_token, access_secret, consumer_key, consumer_secret))
In [24]:
jsonpath = '' # Path to the JSON file where retrieved tweets go
picklepath = '' # Path to the pickle file where retrieved tweets go
In [11]:
usernames = ['UNICEFIndia','satyamevjayate','aamir_khan'] # Example: ['UNICEFIndia','satyamevjayate','UNICEF']
This is from Matthew A. Russell's brilliant "Mining the Social Web, 2nd Edition (O'Reilly, 2013)"
In [6]:
import sys
import time
from urllib2 import URLError
from httplib import BadStatusLine
def make_twitter_request(t_func, max_errors=10, *args, **kw):
# A nested helper function that handles common HTTPErrors. Return an updated
# value for wait_period if the problem is a 500 level error. Block until the
# rate limit is reset if it's a rate limiting issue (429 error). Returns None
# for 401 and 404 errors, which requires special handling by the caller.
def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):
if wait_period > 3600: # Seconds
print >> sys.stderr, 'Too many retries. Quitting.'
raise e
# See https://dev.twitter.com/docs/error-codes-responses for common codes
if e.e.code == 401:
print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
return None
elif e.e.code == 404:
print >> sys.stderr, 'Encountered 404 Error (Not Found)'
return None
elif e.e.code == 429:
print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'
if sleep_when_rate_limited:
print >> sys.stderr, "Retrying in 15 minutes...ZzZ..."
sys.stderr.flush()
time.sleep(60*15 + 5)
print >> sys.stderr, '...ZzZ...Awake now and trying again.'
return 2
else:
raise e # Caller must handle the rate limiting issue
elif e.e.code in (500, 502, 503, 504):
print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \
(e.e.code, wait_period)
time.sleep(wait_period)
wait_period *= 1.5
return wait_period
else:
raise e
# End of nested helper function
wait_period = 2
error_count = 0
while True:
try:
return t_func(*args, **kw)
except api.TwitterHTTPError, e:
error_count = 0
wait_period = handle_twitter_http_error(e, wait_period)
if wait_period is None:
return
except URLError, e:
error_count += 1
time.sleep(wait_period)
wait_period *= 1.5
print >> sys.stderr, 'URLError encountered. Continuing.'
if error_count > max_errors:
print >> sys.stderr, 'Too many consecutive errors...bailing out.'
raise
except BadStatusLine, e:
error_count += 1
time.sleep(wait_period)
wait_period *= 1.5
print >> sys.stderr, 'BadStatusLine encountered. Continuing.'
if error_count > max_errors:
print >> sys.stderr, 'Too many consecutive errors...bailing out.'
raise
This is from Matthew A. Russell's brilliant "Mining the Social Web, 2nd Edition (O'Reilly, 2013)"
In [16]:
def get_user_tweets(t, screen_name=None, user_id=None, max_results=1000):
assert (screen_name != None) != (user_id != None), \
"Must have screen_name or user_id, but not both"
kw = { # Keyword args for the Twitter API call
'count': 200,
'trim_user': 'false',
'include_rts' : 'true',
'since_id' : 1
}
if screen_name:
kw['screen_name'] = screen_name
else:
kw['user_id'] = user_id
max_pages = 16
results = []
tweets = make_twitter_request(t.statuses.user_timeline, **kw)
if tweets is None: # 401 (Not Authorized) - Need to bail out on loop entry
tweets = []
results += tweets
print('Fetched %i tweets from @%s...' % (len(tweets), screen_name))
page_num = 1
# Many Twitter accounts have fewer than 200 tweets so you don't want to enter
# the loop and waste a precious request if max_results = 200.
# Note: Analogous optimizations could be applied inside the loop to try and
# save requests. e.g. Don't make a third request if you have 287 tweets out of
# a possible 400 tweets after your second request. Twitter does do some
# post-filtering on censored and deleted tweets out of batches of 'count', though,
# so you can't strictly check for the number of results being 200. You might get
# back 198, for example, and still have many more tweets to go. If you have the
# total number of tweets for an account (by GET /users/lookup/), then you could
# simply use this value as a guide.
if max_results == kw['count']:
page_num = max_pages # Prevent loop entry
while page_num < max_pages and len(tweets) > 0 and len(results) < max_results:
# Necessary for traversing the timeline in Twitter's v1.1 API:
# get the next query's max-id parameter to pass in.
# See https://dev.twitter.com/docs/working-with-timelines.
kw['max_id'] = min([ tweet['id'] for tweet in tweets]) - 1
tweets = make_twitter_request(t.statuses.user_timeline, **kw)
results += tweets
print('Fetched %i tweets from @%s...' % (len(tweets), screen_name))
page_num += 1
print('Done! We fetched %i tweets from @%s' % (len(results), screen_name))
return results[:max_results]
In [19]:
tweets = []
for username in usernames:
try:
data = get_user_tweets(t, screen_name=username, max_results=3200)
for status in data:
tweets.append(status)
except:
pass
In [20]:
len(tweets)
Out[20]:
In [21]:
with open(jsonpath, 'wb') as tweetsfile: # Get ready to write to output file
json.dump(tweets, tweetsfile) # Write tweets to json file
In [23]:
with open(picklepath, "wb") as tweetsfile:
pickle.dump(tweets, tweetsfile) # Write tweets to pickle file
In [ ]: